首页> 外文OA文献 >Enhancing the searchability of page-image PDF documents using an aligned hidden layer from a truth text
【2h】

Enhancing the searchability of page-image PDF documents using an aligned hidden layer from a truth text

机译:使用真实文本中对齐的隐藏层增强页面图像PDF文档的可搜索性

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The search accuracy achieved in a PDF image-plus-hidden- text (PDF-IT) document depends upon the accuracy of the optical character recognition (OCR) process that produced the searchable hidden text layer. In many cases recognising words in a blurred area of a PDF page image may exceed the capabilities of an OCR engine. This paper describes a project to replace an inadequate hidden textual layer of a PDF-IT file with a more accurate hidden layer produced from a `truth text'. The alignment of the truth text with the image is guided by using OCR- provided page-image co-ordinates, for those glyphs that are correctly recognised, as a set of fixed location points between which other truth-text words can be inserted and aligned with blurred glyphs in the image. Results are presented to show the much enhanced searchability of this new file when compared to that of the original file, which had an OCR-produced hidden layer with no truth-text enhancement.
机译:在PDF图像加隐藏文本(PDF-IT)文档中实现的搜索精度取决于产生可搜索隐藏文本层的光学字符识别(OCR)过程的精度。在许多情况下,识别PDF页面图像的模糊区域中的单词可能会超出OCR引擎的功能。本文描述了一个项目,该项目将用“真实文本”产生的更准确的隐藏层替换PDF-IT文件中不足的隐藏文本层。真相文本与图像的对齐方式是使用OCR提供的页面图像坐标来指导的,对于正确识别的这些字形来说,这些字形是一组固定位置点,可以在这些固定位置点之间插入和对齐其他真相词在图像中带有模糊的字形。结果显示,与原始文件相比,该新文件的搜索能力大大增强,原始文件具有OCR产生的隐藏层,没有事实文本增强。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号